Build a regressor that recommends the county and state that should be considered for future acquisitions based on Acreage > 180, Number of Unique Properties Available, and Price per Acre. If you believe that there are additional or better variables to use to make the determination please include them.
High quality land that has less overhead and does not proportionately benefit as much from economies of scale (enough resources are available, such as water, that expenses to source water - for example - are minimal and therefore, sourcing large volumes of it from long distances is not needed for sustainability - thus disproving the need for economies of scale) is more likely to be competitive and have more ownership segmentation (more properties per acre in a valuable area). For example, a greenhouse farm in the desert would benefit from economies of scale and would not likely be a great place for investment compared to fertile farmland.
Without having more insight into the types of crops and other metrics determining yield, profitability and expenses such as taxes, the model is limited for finding optimized farm investment locations. Given the missingness in the data, no other features beyond acreage, number of unique properties, and price per acre were used at this time. Additional data would help fine tune this model and is something I would work to get if employed at AcreTrader.
The approach leverages a price optimization equation based on linear regression to make a recommendation for County/State combinations. The County and State were combined since the counties are within the states. If this was for a longitude-latitude assessment, a separate model would be constructed to predict ideal longitude and another for ideal latitude, with residual errors from each model combined to assess the total error. As additional terms would be added to each model under the longitude-latitude scenario, combined error between models would be compared, with the lowest aggregated error representing the best model. The extra sum of squares approach would also be useful for comparing additional linear models for this objective.
The underlying business assumption of this model is that scarcity of land drives real estate prices higher, possibly as a result of an abundance of resources. Therefore, this model is based on finding demand and assumes that investors will rush to concentrated areas in an attempt to purchase land, increasing both property per acre as well as price per acre, which this model finds, to not be a false assumption (null hypothesis rejected; p-value < 0.0001, 95% confidence interval [58.227, 91.572]). Using a starting intercept for price per acre of 3.22 (p-value < 0.0001, 95% confidence interval [3.168, 3.268]), price scales linearly by 74.9 for each additional unit of property per acre that is added. In other words, as more properties per acre exist, the price increases, appreciating the investment. Therefore, it would be ideal to invest in an area with the most properties per acre as this is a good sign of a successful farming community, pointing to good resources such as supplies, distribution routes, and markets, among others.
Because of the lack of more data, the model is relatively imprecise (low r-squared value). However, the relationship between properties_per_acre and price_per_acre is reliable. The low r-squared value means there is a reasonably large amount of error around the regression line and the farther from the line, the less reliably the model. Adding additional terms to the model may improve this fit. For example, agricultural revenue per county, crops per county interacting with revenue yield per crop type per unit, or tax rates for a cost component. This could result in more fluctuation between counties and therefore, a more dynamic relationship.
This model identified Burt County, Nebraska as the best place to invest in a farm. Burt County is on the eastern edge of the Ogallala Aquifer and western edge (sometimes on both edges) of the Missouri River, both likely providing mineral deposits into the soil as well as crop irrigation supply. Burt County also contains four major highways - highways 75, 77, 51, and 32 - is across the Missouri River from Interstate Highway 29, and is 54 miles from Omaha, Nebraska, where the Union Pacific Railroad - the largest railroad in the United States - is headquartered. Omaha is considered a major domestic transportation hub, nicknamed the "Gate City of the West." Therefore, proximity to the city is a major benefit to commerce. However, important to note is that there is a risk of flooding in Burt County; the Missouri River shifts here and snow melt is also a concern, in addition to rain. Major flooding occurred most recently in 1993, 2008, 2015, and 2019.
There is one 180-acre property in Burt County for $1,084,500.00. It has the highest value of properties per acre. At this point, time series data about this property could be useful for building a forecasting model to learn more about agriculture in this county, such as historic patterns and trends. This could provide more understanding about the property and what could be expected of it in the next few years.
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import warnings
warnings.filterwarnings('ignore')
original_data = pd.read_csv('/home/pablo/Desktop/AcreTrader/farms.csv')
original_data.shape
df = original_data[original_data['acres']>180]
df.shape
df['price'] = df['price'].str.replace('$', '').astype(float);
profile = ProfileReport(df, title="Pandas Profiling Report")
profile
df2 = df[['acres', 'price', 'state', 'county', 'acres', 'price_per_acre', 'address']]
There are missing addresses so replacing them all with integers
j=1
address_id = []
for i in range(0, len(df2)):
address_id.append(j)
j+=1
df2['address'] = address_id
df2['county_state'] = df['county'] + '_' + df['state']
unique_properties = df2.groupby(by='county_state', as_index=False).agg({'address': pd.Series.nunique})
unique_properties.columns = ['county_state','unique_properties']
unique_properties.head()
unique_properties.shape
df3 = df2.iloc[:,1:8].drop(['address', 'state','county', 'price_per_acre'], axis=1) # dropping duped acres
df4 = df3.groupby(by='county_state', as_index=False).agg({'price': pd.Series.sum, 'acres':pd.Series.sum})
df5 = df4.merge(unique_properties, on='county_state')
df5['price_per_acre'] = df5['price']/df5['acres']
df5['properties_per_acre'] = df5['unique_properties']/df5['acres']
df6 = df5.drop(['acres','price','unique_properties'], axis=1)
df_cat = df6.iloc[:,0]
df_num = df6.iloc[:,1:]
import matplotlib.pyplot as plt
dataset = pd.DataFrame(df_num)
dataset.hist(figsize=(10,4))
plt.show()
Normalize the price_per_acre variable:
df_num['price_per_acre'] = np.log10(df_num['price_per_acre'])
dataset = pd.DataFrame(df_num)
dataset.hist(figsize=(10,4))
plt.show()
df7 = pd.concat([df_cat, df_num], axis=1)
df7.head()
Below, I did not check Cook's D to see if the outlier below 1 price_per_acre (Jackson, IA) created enough leverage to throw off results of the test. Based on the scale, I assume it would not. Had the outlier been 0 and the intercept positioned at 10, for example, this may be worthwhile. Alternatively, if there had been more extreme outlying points in the plot below, Cook's D might also be worth checking.
import seaborn as sns
sns.lmplot(x = 'properties_per_acre', y = 'price_per_acre', data = df7, fit_reg = True, size = 4);
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
%matplotlib inline
# fit OLS model
model = ols("price_per_acre ~ properties_per_acre", data = df7).fit()
# print model summary
print(model.summary())
properties_per_acre = df7['properties_per_acre'].tolist()
b0 = model.params._values[0]
b1 = model.params._values[1]
profitability = []
for i in properties_per_acre:
scaled_price_per_acre = b0 - b1 * i # if other terms in model, include them here
profitability.append((i) * scaled_price_per_acre)
target_demand = pd.DataFrame({"Properties_Per_Acre": properties_per_acre, "Profitability": profitability})
target_demand[target_demand['Profitability'] == target_demand['Profitability'].max()]
best_investment = target_demand[target_demand['Profitability'] == target_demand['Profitability'].max()].iloc[0,0]
df7.head()
best_county_state = df7[df7['properties_per_acre'].round(6) == best_investment.round(6)].iloc[0,0]
df5[df5['county_state'] == best_county_state]
df5.sort_values(by=['properties_per_acre'], ascending=False)